execution_mode = 'manual'
The first model to be trained with supervised learning is to be a Decision Tree Classifier, compare [JudACaps]. This chapter shows the training and the performance measurement of two Ensemble models. First, a simple Decision Tree Classifier is trained without cross-validation, then the classifier will be statistically hardened with the help of cross-validation. As a second Ensemble classifier, a Random Forests will be trained and its performance tested.
As the first step, the data from previous chapters have to be read in as input for processing in this chapter.
import os
import pandas as pd
import bz2
import _pickle as cPickle
path_goldstandard = './daten_goldstandard'
#Â Restore results so far
df_labelled_feature_matrix = pd.read_pickle(os.path.join(path_goldstandard,
'labelled_feature_matrix.pkl'),
compression=None)
#Â Restore DataFrame with features from compressed pickle file
with bz2.BZ2File((os.path.join(
path_goldstandard, 'labelled_feature_matrix_full.pkl')), 'rb') as file:
df_attribute_with_sim_feature = cPickle.load(file)
df_labelled_feature_matrix.head()
print('Part of duplicates (1) and uniques (0) in units of [%]')
print(round(df_labelled_feature_matrix.duplicates.value_counts(normalize=True)*100, 2))
Decision Tree is the most basic algorithm in the family of Ensemble methods. Its advantage is its clarity. It can be easily interpreted when looking at the trained model tree.
The train/test split has been implemented as a general function $\texttt{.split}\_\texttt{feature}\_\texttt{target}()$ in a separate library called classifier_fitting_funcs.py. The function uses the library function $\texttt{sklearn.model}\_\texttt{selection.train}\_\texttt{test}\_\texttt{split}()$ from scikit-learn with a parameter $\texttt{stratify}$ in order to generate a balanced distribution of the two classes in the split data, the same as in the original distribution.
import classifier_fitting_funcs as cff
X_tr, X_val, X_te, y_tr, y_val, y_te, idx_tr, idx_val, idx_te = cff.split_feature_target(
df_labelled_feature_matrix, 'train_validation_test')
X_tr[:5], y_tr[:5], idx_tr[:5]
The train/test split is done twice. The first split generates an intermediate set of data for training which consists of 80% and a set of data for testing which consists of 20% of the full data. The second split takes the intermediate training data as its basis and extracts an 80% set out of there which will be used for training the model. The remaining 20% out of the intermediate training data will be used for validating the model during training. This strict separation of the data used for training and used for validating the model conforms to the basic principle of machine learning that any model is to be tested with unseen data. If this principle is hurt and the test data is polluted with data, the model has been in contact with during the training phase, the model runs the risk of bias on validation.
print(X_tr.shape, y_tr.shape, X_val.shape, y_val.shape, X_te.shape, y_te.shape)
print('The test data set holds {:d} records of uniques and {:d} records of duplicates.'.format(
len(y_te[y_te==0]), len(y_te[y_te==1])))
Grid search is to be done with the Decision Tree Classifier. Goal is to find the best parameter set for the classifier. First, the parameter ranges, the grid points in the grid space are defined. In the following code cell, a global parameter $\texttt{execution}\_\texttt{mode}$ for controlling the size of the grid is used. Several run-modes of this notebook are foreseen. The global parameter is set in the very first code cell of this notebook and can be overwritten from outside when the notebook is called by Overview and Summary. When called from outside, a larger range of the grid space shall be executed with the goal to get a systematic result of the calculations. The execution of the notebook in its local mode has the goal to be done quickly, just to get a basic idea on how the models behave.
if execution_mode == 'manual' :
depths = list(range(2, 30, 2)) # The number of features is 20.
depths.extend([35, 40, 50, None])
parameter_dictionary = {
'max_depth' : depths,
'criterion' : ['gini'],
'class_weight' : ['balanced']
}
elif execution_mode == 'full' :
# Find best parameters of Decision Tree
depths = list(range(4, 32, 2))
depths.extend([35, 40, 45, 50, None])
parameter_dictionary = {
'max_depth' : depths,
'criterion' : ['gini', 'entropy'],
'class_weight' : [None, 'balanced']
}
elif execution_mode == 'restricted' :
depths = list(range(16, 26, 2)) # The number of features is 20.
depths.extend([None])
parameter_dictionary = {
'max_depth' : depths,
'criterion' : ['gini', 'entropy'],
'class_weight' : [None, 'balanced']
}
elif execution_mode == 'tune' :
# Tune parameters of Decision Tree
depths = list(range(1, 31))
depths.extend([35, 40, 45, 50, None])
parameter_dictionary = {
'max_depth' : depths,
'criterion' : ['gini', 'entropy'],
'class_weight' : ['balanced']
}
#Â Grid of values
grid = cff.generate_parameter_grid(parameter_dictionary)
The Decision Tree Classifier is fitted with grid search with the help of a function $\texttt{.fit}\_\texttt{model}\_\texttt{measure}\_\texttt{scores}()$ implemented in library classifier_fitting_funcs.py. This function takes the model instance as parameter and returns the scores of the fitted model on the validation data.
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=0)
# Save accuracy on test set
test_scores = []
for params_dict in grid :
test_scores.append(cff.fit_model_measure_scores(dt, params_dict, X_tr, y_tr, X_val, y_val))
#Â Save measured accuracies
df_test_scores_dt = pd.DataFrame(test_scores).sort_values('accuracy_val', ascending=False)
Naming the missing ($\texttt{None}$) entry of $\texttt{class}\_\texttt{weight}$ 'unbalanced', will make the test score data more speaking.
ts_dict = {}
# kow = kind of weight
for kow in parameter_dictionary['class_weight']:
ts_dict['unbalanced' if kow is None else kow] = [
ts for ts in test_scores if ts['class_weight'] == kow]
Plotting the accuracy scores as a function of tree depth, is a way of determining the best tree depth for a Decision Tree Classifier. Very often, the accuracy for the training data increases monotonically with increasing tree depth towards its maximum value. This monotonical increase of the accuracy score is a sign of overfitting with the training data. The accuracy scores calculated with the validation data is expected to show a different behaviour, though. Validating the trained model with the validation data very often shows a distinct maximum value and a decrease of the accuracy score for higher values of tree depth after this maximum. The maximum accuracy score value for the validation data is interpreted as the best value for the tree depth of the best Decision Tree Classifier.
%matplotlib inline
import matplotlib.pyplot as plt
import results_analysis_funcs as raf
for kow in parameter_dictionary['class_weight'] :
kind_of_weight = 'unbalanced' if kow is None else kow
# Train data plot
plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'accuracy_tr')
# Validation data plot
plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'accuracy_val')
plt.ylabel('accuracy')
plt.title(f'Measured accuracy on {kind_of_weight} train and validation data')
plt.legend()
plt.show()
# Validation data plot
plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'log_accuracy_val')
plt.ylabel('log(1-accuracy)')
plt.title(f'Measured accuracy on {kind_of_weight} validation data')
plt.legend()
plt.show()
The observation above does not show the expected effect of the accuracy score for the validation data. Even with a logarithmic scale, only a constant maximum with arbitrarily increasing tree depth can be observed. In this situation, the first tree depth which hits this maximum accuracy score value, will be taken for the best Decision Tree Classifier model.
best_params = cff.get_best_parameters(test_scores, parameter_dictionary)
dt_best = DecisionTreeClassifier(criterion=best_params['criterion'],
max_depth=best_params['max_depth'],
class_weight=best_params['class_weight'], random_state=0)
dt_best.fit(X_tr, y_tr)
Having 20 features in the feature matrix, a maximum tree depth of 20 is a fair result. Let's have a look at the graph of the Decision Tree.
! pip install graphviz
path_tree_graphics = './documentation'
#Â Path for Decision Tree
decision_tree_dot = os.path.join(path_tree_graphics, 'decision_tree.dot')
decision_tree_png = os.path.join(path_tree_graphics, 'decision_tree.png')
from sklearn.tree import export_graphviz
#Â Export decision tree
dot_data = export_graphviz(
dt_best, out_file=decision_tree_dot,
feature_names=df_labelled_feature_matrix.drop(columns=['duplicates']).columns,
class_names=['unique', 'duplicate'],
filled=True, rounded=True, proportion=True
)
# Generate image in .png format
! dot -Tpng $decision_tree_dot -o $decision_tree_png
from IPython.display import Image
Image(decision_tree_png)
Counting the layers of the tree confirms its depth of 20.
The confusion matrix is used for testing the performance of the classifier [ConfMatr], see figure 1. In the confusion matrix, the records of class duplicate are the positive case, while the records of class unique are the negative case. The true negatives (uniques) and the true positives (duplicates) are the correctly classified predictions, where the notion "correct" refers to correct according to the classification of the provided test data set. The false negatives are the records that the model predicts as uniques but the reality of the test data classifies as duplicates. The false positives are the records that the model predicts as duplicates but the reality of the test data classifies as uniques.
from sklearn.metrics import confusion_matrix
y_pred_dt = dt_best.predict(X_te)
confusion_matrix(y_te, y_pred_dt)
Looking at the numbers, above, the specific numbers depend on the parameters used for the model calculation and on the number of records used for training and for testing.
The explicit assessment of the specific figures will be done in Overview and Summary depending on the specific parameters used for a run.
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score
print('Score {:.3f}%'.format(100*dt_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
100*roc_auc_score(y_te, y_pred_dt),
100*accuracy_score(y_te, y_pred_dt),
100*precision_score(y_te, y_pred_dt),
100*recall_score(y_te, y_pred_dt)
))
The confusion matrix allows for calculating some characteristic numbers [ConfMatr].
The prediction $y_{pred}$ of a classifier for a record of the training or test data set is based on the prediction probability $y_{pred}^{probability}$, a tuple of two numbers in the closed interval from 0 to 1 where the sum of the two tuple elements is equal to 1 $$y_{pred}^{probability} = (a, b) \texttt{ with } a, b \in [0, 1] \texttt{ and } a+b = 1.$$ Function $\texttt{.predict}()$ of the classifer uses a value of 0.5 to assign a record uniquely to either class. This value of 0.5 is the default threshold of the classifier. To get $y_{pred}^{probability}$, the model's function $\texttt{.predict}\_\texttt{proba}()$ can be called. With the resulting raw probability tuple, the threshold can be adjusted. The effect of varying the threshold is a shift in the allocations of records in the quadrant of the confusion matrix. This is equivalent to a change of the characteristic numbers. Modifying the threshold value allows for tuning a model with the goal of maximizing a desired characteristic number. As an example, the precision may be maximized. The increase of one characteristic number will decrease the other characteristic numbers like the accuracy, though.
y_proba = pd.Series(dt_best.predict_proba(X_te)[:,1])
y_proba[(y_proba>0) & (y_proba<1)] #Â Empty Series means no result
Unfortunately, the Decision Tree Classifier exclusively predicts probability tuples of kind $(0,1)$ for duplicates and $(1,0)$ for uniques. Therefore, the effect of changing threshold cannot be illustrated with this classifier. This will be made up below with the Random Forests classifier.
With the notion of the threshold, one more characteristic number can be explained. The roc auc (area under the receiver operating characteristic curve) is derived from a graphical plot of the fraction of the true positive rate $tpr$ versus the false positive rate $$fpr = \frac{fp}{fp+tn}$$ at various settings of the threshold [rocauc]. The value of the roc auc may vary between 0 and 1. If a classifier does not generate any relevant information, its value is 0.5. The closer the roc auc value is to 1, the better is the prediction quality of the classifier.
With these characterstic numbers derived from the confusion matrix, the prediction performance of a classifier is quantified. The comparison of the characteristic numbers of different classifiers will produce a ranking in chapter Overview and Summary. The ranking metric for assessing the overall best model of all calculated models is to remain the accuracy. If the accuracy happens to be equal for two different models, then the roc auc will be considered for a second metric.
This kind of fine ranking with a another metric number has to be pointed out. Within a model, the best classifier is ranked first using the accuracy score. When comparing and ranking the models among each other, the metric for assessing the rank remains the accuracy score. Adding the roc auc value for fine ranking moved metric numbers like precision and recall into additional consideration. This augmented information on the model's performance is the motivation for this kind of fine ranking. Unfortunately, there is no garantee that the roc auc value as a balanced mixture of several scoring values holds the best possible value for the model with the best accuracy. This weakness will be accepted for this capstone project, though.
# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_attribute_with_sim_feature.columns)
df_attribute_with_sim_feature.iloc[idx_te].sort_index().sample(n=5)
In the confusion matrix, the false positives and the false negatives are the wrongly predicted records. One way of tuning a classifier may be using different kinds of similarity metrics for an attribute. It is crucial to look at the wrongly predicted records to get an idea on the effect of the similarity metrics used. This analysis with an improvement of the data records has been done iteratively in the course of the capstone project. Some analysis will be illustrated in chapter Overview and Summary. To do so, all wrongly predicted records need to be stored in order to hand them over to the summary chapter. This is done with the help of a specific library function $\texttt{.add}\_\texttt{wrong}\_\texttt{predictions}()$.
import results_saving_funcs as rsf
idx = {}
idx['true_predicted_uniques'], idx['true_predicted_duplicates'], idx['false_predicted_uniques'], idx['false_predicted_duplicates'] = raf.get_confusion_matrix_indices(y_te, y_pred_dt)
wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
for i in wrong_prediction_groups :
rsf.add_wrong_predictions(path_goldstandard,
dt, i, df_attribute_with_sim_feature.iloc[idx_te].iloc[idx[i]])
The performance measurement described in this subsection will be repeated for all the models to come. The process will focus exclusively on the code and will leave out any additional description.
In order to reach a model with a strong statistical stability, cross-validation can be used when training the model. This section will use an object $\texttt{GridSearchCV}$ from scikit-learn for this purpose.
When doing cross-validation, the training data is split into training and validation data by the $\texttt{GridSearchCV}$ object from scikit-learn. Therefore, it is sufficient to split the original data into a train and a test data set without any additional splitting of the train data.
X_tr, _, X_te, y_tr, _, y_te, idx_tr, _, idx_te = cff.split_feature_target(
df_labelled_feature_matrix, 'train_test')
X_tr[:5], y_tr[:5], idx_tr[:5]
print(X_tr.shape, y_tr.shape, X_te.shape, y_te.shape)
print('The test data set holds {:d} records of uniques and {:d} records of duplicates.'.format(
len(y_te[y_te==0]), len(y_te[y_te==1])))
The grid search for Decision Tree classifier with cross-validation will be done with the same parameter space like for Decision Tree classifier without cross-validation. In this way, the effect of cross-validation will become obvious.
from sklearn.model_selection import GridSearchCV
import numpy as np
#Â Create cross-validation object with DecisionTreeClassifer
grid_cv = GridSearchCV(DecisionTreeClassifier(random_state=0),
param_grid = parameter_dictionary, cv=5
, verbose=1
)
# Fit estimator
grid_cv.fit(X_tr, y_tr)
#Â Get the results with 'cv_results_', get parameters with their scores
params = pd.DataFrame(grid_cv.cv_results_['params'])
scores = pd.DataFrame(grid_cv.cv_results_['mean_test_score'], columns=['accuracy_val'])
log_scores = pd.DataFrame(-np.log(1-grid_cv.cv_results_['mean_test_score']), columns=['log_accuracy_val'])
scores_std = pd.DataFrame(grid_cv.cv_results_['std_test_score'], columns=['std_accuracy_val'])
# Create a DataFrame of (parameters, score, std) pairs
df_test_scores_dtcv = params.merge(scores, how='inner', left_index=True, right_index=True)
df_test_scores_dtcv = df_test_scores_dtcv.merge(
scores_std, how='inner', left_index=True, right_index=True).sort_values(
'accuracy_val', ascending=False)
df_test_scores_dtcv = df_test_scores_dtcv.merge(
log_scores, how='inner', left_index=True, right_index=True)
df_test_scores_dtcv.sort_values(by='accuracy_val', ascending=True)
The validation accuracy can be plotted as a function of the tree depth.
ts_dict = {}
# Reorder on index for x-axis
df_test_scores_dtcv.sort_index(inplace=True)
for kow in parameter_dictionary['class_weight']:
ts_dict['unbalanced' if kow is None else kow] = [
ts for ts in df_test_scores_dtcv.to_dict('records')
if ts['class_weight'] == kow]
for kow in parameter_dictionary['class_weight'] :
kind_of_weight = 'unbalanced' if kow is None else kow
# Validation data plot
plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'accuracy_val')
plt.ylabel('accuracy')
plt.title(f'Measured accuracy on {kind_of_weight} train and validation data')
plt.legend()
plt.show()
# Validation data plot
plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'log_accuracy_val')
plt.ylabel('log(1-accuracy)')
plt.title(f'Measured accuracy on {kind_of_weight} validation data')
plt.legend()
plt.show()
For the $\texttt{GridSearchCV}$ object, the best estimator can be retrieved with the help of attribute $\texttt{best}\_\texttt{estimator}\_$. The parameters for the best estimator tree are shown below. They confirm the graphs above.
dtcv_best = grid_cv.best_estimator_
dtcv_best
Let's have a look at the tree of the best estimator.
#Â Path for Decision Tree
decision_tree_cv_dot = os.path.join(path_tree_graphics, 'decision_tree_cv.dot')
decision_tree_cv_png = os.path.join(path_tree_graphics, 'decision_tree_cv.png')
#Â Export decision tree
dot_data = export_graphviz(
dtcv_best, out_file=decision_tree_cv_dot,
feature_names=df_labelled_feature_matrix.drop(columns=['duplicates']).columns,
class_names=['unique', 'duplicate'],
filled=True, rounded=True, proportion=True
)
# Generate image in .png format
! dot -Tpng $decision_tree_cv_dot -o $decision_tree_cv_png
Image(decision_tree_cv_png)
The confusion matrix is used on the test data set for performance analysis, see subsection Performance Measurement of Decision Tree.
y_pred_dtcv = dtcv_best.predict(X_te)
confusion_matrix(y_te, y_pred_dtcv)
The scoring figures will be assessed in chapter Overview and Summary.
print('Score {:.3f}%'.format(100*dtcv_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
100*roc_auc_score(y_te, y_pred_dtcv),
100*accuracy_score(y_te, y_pred_dtcv),
100*precision_score(y_te, y_pred_dtcv),
100*recall_score(y_te, y_pred_dtcv)
))
The prediction probability tuples $y_{pred}^{probability}$ report values of $(1, 0)$ and $(0, 1)$ as seen in subsection Performance Measurement of Decision Tree.
y_proba = pd.Series(dtcv_best.predict_proba(X_te)[:,1])
y_proba[(y_proba>0) & (y_proba<1)] #Â Empty Series means no result
The last step of the performance measurement subsection is to persist the wrongly classified records for full assessment in chapter Overview and Summary.
idx = {}
idx['true_predicted_uniques'], idx['true_predicted_duplicates'], idx['false_predicted_uniques'], idx['false_predicted_duplicates'] = raf.get_confusion_matrix_indices(y_te, y_pred_dtcv)
wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
for i in wrong_prediction_groups :
rsf.add_wrong_predictions(path_goldstandard,
dtcv_best, i, df_attribute_with_sim_feature.iloc[idx_te].iloc[idx[i]], '_CV')
Another Ensemble method is Random Forests. The results of this classifier will be presented in this section.
The train/test split for Random Forests will be done the same way like for the Decision Tree classifier with the goal to have three distinct data sets, one for training, one for validation and one for performance testing.
X_tr, X_val, X_te, y_tr, y_val, y_te, idx_tr, id_val, idx_te = cff.split_feature_target(
df_labelled_feature_matrix, 'train_validation_test')
X_tr[:5], y_tr[:5], idx_tr[:5]
print(X_tr.shape, y_tr.shape, X_val.shape, y_val.shape, X_te.shape, y_te.shape)
print('The test data set holds {:d} records of uniques and {:d} records of duplicates.'.format(
len(y_te[y_te==0]), len(y_te[y_te==1])))
The parameters for a Random Forests classifier are different to the parameters of the Decision Tree Classifier. This is due to the differences in the algorithms, see the scikit-learn documentation for details.
if execution_mode == 'manual' :
depths = [18, 20, 22]
depths.append(None)
parameter_dictionary = {
'n_estimators' : [50, 75, 100],
'max_depth' : depths,
'class_weight' : [None]
}
elif execution_mode == 'full' :
depths = list(range(10, 30, 2))
depths.append(None)
parameter_dictionary = {
'n_estimators' : [8, 16, 32, 64, 128],
'max_depth' : depths,
'class_weight' : [None, 'balanced']
}
elif execution_mode == 'restricted' :
depths = [18, 20, 22, 24]
depths.append(None)
parameter_dictionary = {
'n_estimators' : [128],
'max_depth' : depths,
'class_weight' : [None]
}
elif execution_mode == 'tune' :
#Â Tune random forest classifier
depths = list(range(16, 27))
parameter_dictionary = {
'n_estimators' : list(range(70, 125, 5)),
'max_depth' : depths,
'class_weight' : [None, 'balanced']
}
#Â Grid of values
grid = cff.generate_parameter_grid(parameter_dictionary)
from sklearn.ensemble import RandomForestClassifier
#Â Create random forest
rf = RandomForestClassifier(random_state=0) #Â Leave impurty measure on default value 'gini'
# Save accuracy on test set
test_scores = []
for params_dict in grid :
test_scores.append(cff.fit_model_measure_scores(rf, params_dict, X_tr, y_tr, X_val, y_val))
#Â Save measured accuracies
df_test_scores_rf = pd.DataFrame(test_scores).sort_values('accuracy_val', ascending=False)
The Random Forests parameters for the best model are shown below.
best_params = cff.get_best_parameters(test_scores, parameter_dictionary)
# Create a decision tree
rf_best = RandomForestClassifier(n_estimators=best_params['n_estimators'],
max_depth=best_params['max_depth'],
class_weight=best_params['class_weight'],
random_state=0
)
# Fit estimator
rf_best.fit(X_tr, y_tr)
The confusion matrix and the scoring values for the model are shown below.
y_pred_rf = rf_best.predict(X_te)
confusion_matrix(y_te, y_pred_rf)
print('Score {:.3f}%'.format(100*rf_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
100*roc_auc_score(y_te, y_pred_rf),
100*accuracy_score(y_te, y_pred_rf),
100*precision_score(y_te, y_pred_rf),
100*recall_score(y_te, y_pred_rf)
))
As described and expected in subsection Performance Measurement of Decision Tree, the prediction probability tuples $y_{pred}^{probability}$ report values within the open interval $(0, 1)$.
y_proba = pd.Series(rf_best.predict_proba(X_te)[:,1])
rf_best.predict_proba(X_te)[(y_proba>0) & (y_proba<1)]
Changing the threshold away from its default value results in modifyed values of the confusion matrix and of the scoring values, as described in subsection Performance Measurement of Decision Tree.
threshold = 0.1 # Modify threshold value (default is 0.5) => Tune model
y_pred_threshold = y_proba.apply(lambda x: 1.0 if x >= threshold else 0.0)
confusion_matrix(y_te, y_pred_threshold)
print('Original score with default threshold : {:.3f}% (see above)'.format(100*rf_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
100*roc_auc_score(y_te, y_pred_threshold),
100*accuracy_score(y_te, y_pred_threshold),
100*precision_score(y_te, y_pred_threshold),
100*recall_score(y_te, y_pred_threshold)
))
Finally, the wrongly predicted records for the Random Forests classifier need to be persisted for final assessment in the summary chapter. The prediction for the default threshold is taken.
idx = {}
idx['true_predicted_uniques'], idx['true_predicted_duplicates'], idx['false_predicted_uniques'], idx['false_predicted_duplicates'] = raf.get_confusion_matrix_indices(y_te, y_pred_rf)
wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']
for i in wrong_prediction_groups :
rsf.add_wrong_predictions(path_goldstandard,
rf, i, df_attribute_with_sim_feature.iloc[idx_te].iloc[idx[i]])
For Random Forests an attribute is provided which returns an array indicating the importance of each feature. The higher the value, the more important the feature.
x_ticks = df_labelled_feature_matrix.drop(columns=['duplicates']).columns
plt.figure(figsize=(12,4))
plt.bar(x_ticks, rf_best.feature_importances_, color='red')
for i in range(len(x_ticks)):
plt.text(i-0.6, 2/10, f'{rf_best.feature_importances_[i]*100:.2f}%',
color='black', rotation=30, fontsize=13)
plt.xticks(rotation='vertical')
plt.title('Feature importance')
plt.xlabel('feature')
plt.ylabel('normed importance value')
plt.show()
The feature importance of an attribute is correlated to the degree of filling of this attribute, see chapter Data Analysis. Apart from that attribute property, its value gives insight into the role, an attribute similarity plays for a record of pairs. It may be different for varying similarity metrics used for one and the same attribute. Therefore the feature importance is an indicator for controlling the similarity metrics used.
This chapter has trained the first models for prediction of the class of an unknown data set of test records. The calculated models belong to the family of Ensemble classifiers. The performance of each model has been measured and the way measurement used repeatedly has been introduced and explained with the very first model of Decision Tree Classifier. The models of this chapter will be compared with the results of the Dummy Classifier of chapter Features Discussion and Dummy Classifier Baseline and with all additional models to come. The assessment will be done in chapter Overview and Summary.
The final results will be assessed with the help of the same test data for all three models of this chapter. The train/test split will not be repeated here, as all train/test split calls of this chapter have generated the same test data set, due to fixing $\texttt{random}\_\texttt{state}=0$. The results of this chapter still have to be persisted.
path_results = './results'
rsf.add_result_to_results(path_results,
df_test_scores_dt, dt_best, X_te, y_te, y_pred_dt)
rsf.add_result_to_results(path_results,
df_test_scores_dtcv, dtcv_best, X_te, y_te, y_pred_dtcv, '_CV')
rsf.add_result_to_results(path_results,
df_test_scores_rf, rf_best, X_te, y_te, y_pred_rf)